The distribution of the mean of a random sample from a population with finite variance is approximately normally distributed when the sample size is large, regardless of the shape of the population's distribution. The distribution of the means of random samples will look like a normal distribution. This means we can approximate some distribution with a normal distribution although the distribution is not normally distributed.
Because of CLT it's possible to make probabilistic inferences about population parameter values based on statistic samples.
As the sample size approaches infinity the center of the distribution of the sample means becomes very close to the population mean. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
As we add dimensions we increase the processing power we need to analyze the data, and we also increase the amount of training data required to make meaningful models. As the number of features increases, the classifier's performance increases as well until an optimal number of features- adding more features based on the same amount of data will then degrade the classifier's performance.
KNN is very suceptible to overfitting due to the curse of dimensionality. Curse of dimensionality also describes the phenomenon where the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. Intuitively, we can think of even the closest neighbors being too far away in a high-dimensional space to give a good estimate. As our data becomes increasingly sparse we risk overfitting and performing poorly on a test set.
In essence, an eigenvector v of a linear transformation T is a non-zero vector that, when T is applied to it, does not change direction. Applying T to the eigenvector only scales the eigenvector by the scalar value λ, called an eigenvalue.
A more general version of eigenvalue decomposition- as it can only be applied to a diagonalizable matrix.
Used to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
Probability is the area under the curve of a distribution between two amounts. Imagine a normal distribution of mouse weights, we may write the probability like p(weight between 32 and 34 grams| mean=32 and std=2.5) = .29. As we change the amounts we find new probabilities, where the right side of our equation describes the right side and the left side describes the area under the curve.
Likelihood assumes we have some observed data. If we have a 34 gram mouse we can look at the the y-axis of our distibution to get the likelihood of the left side of the equation- the parameters of the distribution. Written like L(mean=32 and std=2.5|mouse weights 34 grams). The right side stays fixed as it defined our data and we shift the left side to get the likelihood that those parameters define the distribution that our data comes from.
MLE or maximum likelihood estimation finds the most likely parameters that define a normal distribution that describes our observed data. MLE is an example of statistical inference. There are other examples of inference like Bayesian inference.
The difference is the penalty term. Ridge uses a squared magnitude of coefficient as penalty term. Lasso uses absolute value of magnitude of coefficient as penalty term to loss.
Lasso penalizes the sum of the absolute values- as a result, for high values of lambda, many coefficients are zeroed under lasso, which is never the case in ridge regression. Lasso tends to do well if there are a small number of significant predictor parameters, and others are close to zero. Ridge tends to do better when there are many large parameters about the same value. Ultimately we should run cross-validation and choose whichever performs better.
The standard deviation of the sampling distribution of the sample mean.
The AUC-ROC curve is used in binary classification to tell us how much the model is capable of distingusighing between classes. The higher the auc the better the model is at predicting class. The AUC score is determined by TRP(true positive rate) and FPR(False positive rate) also called recall/sesitivity and specificity. We can use auc-roc for multi-class classification by drawing auc-roc curves for each class and avereging with equal weight (macro-averaging) or draw one curve that considers each element as a binary prediction (micro-averaging).
In a z-test, the sample is assumed to be normally distributed. A z-score is calculated with population parameters such as “population mean” and “population standard deviation” and is used to validate a hypothesis that the sample drawn belongs to the same population.
$H_o$: Sample mean is same as the population mean
$H_a$: Sample mean is not same as the population mean
Compares two averages and tells you if they are different from each other. The t test also tells you how significant the differences are. Use it to compare the means of two groups to figure out the probability that their differecnes are the results of chance. A t-test is used when the population parameters (mean and standard deviation) are not known. The t score is the ratio between the difference between two groups and the difference within the groups.
Chi-square test is used to compare categorical variables. There are two type of chi-square test
a. A small chi-square value means that data fits
b. A high chi-square value means that data doesn’t fit.
$H_o$: Variable A and Variable B are independent
$H_a$: Variable A and Variable B are not independent.
Also known as analysis of variance, is used to compare multiple (three or more) samples with a single test. There are 2 major flavors of ANOVA
$H_o$: All pairs of samples are same i.e. all sample means are equal
$H_a$: At least one pair of samples is significantly different
Based on collecting and analyzing a large amount of information on users' behaviors, activites or preferences and predicting what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it doesn't require understanding the characteristics of items.
Based on the description of an item and the profile of a user's preferences. In content-based recommender systems, you need to have a profile of each item, like Pandora, where a single seed is used to get other songs of similar musical characteristics. As the user interacts with Pandora they can build a profile of the user.